For Homework 2, I’ve been requested to analyze a time series and forecast the value of the next month of the related dataset with my own effort.
In the explanation below, I’ve tried to apply several linear models to forecast the next month’s value by utilizing several types of variables.
The dataset that I am given is Firm Statistics-Newly Established Total in Numbers in the Production Statistics class which resides in EVDS. I’ve selected data from January 2010 to March 2021 to have a sufficient amount of data.
The data and its smoothed version by geom_smooth is in the plot below.
options(repr.plot.width=30, repr.plot.height=18)
ggplot(dataframe, aes(x=Date,y=NET_Number)) +
geom_line(data=dataframe, aes(x=Date, y=NET_Number, colour = "real")) +
geom_smooth(se=FALSE, colour = "blue") +
labs(colour = "Data Types", title = "Newly Established Totals as Numbers", y="NET Number") +
theme(legend.text = element_text(size = 16),
axis.title.y = element_text(size = 19),
axis.title.x = element_text(size = 19),
axis.text.x = element_text(size = 15),
axis.text.y = element_text(size = 15),
plot.title = element_text(size = 22, hjust = 0.5))
As we can see, there’s a decline until mid 2012, then a trend of increase occurs until today, with several exceptions. We can see the structure of the initial data and the processed version.
head(main_data$items, n=10)
## Tarih TP_AC2_TOP_A $numberLong
## 1 2010-1 5000 1262300400
## 2 2010-2 4230 1264978800
## 3 2010-3 5055 1267398000
## 4 2010-4 4466 1270072800
## 5 2010-5 4072 1272664800
## 6 2010-6 4327 1275343200
## 7 2010-7 3979 1277935200
## 8 2010-8 4369 1280613600
## 9 2010-9 3848 1283292000
## 10 2010-10 4284 1285884000
tail(main_data$items, n=10)
## Tarih TP_AC2_TOP_A $numberLong
## 126 2020-6 9719 1590962400
## 127 2020-7 10394 1593554400
## 128 2020-8 9496 1596232800
## 129 2020-9 10764 1598911200
## 130 2020-10 10419 1601503200
## 131 2020-11 8782 1604185200
## 132 2020-12 8560 1606777200
## 133 2021-1 11428 1609455600
## 134 2021-2 10001 1612134000
## 135 2021-3 11034 1614553200
head(dataframe, n=10)
## Date NET_Number
## 1 2010-01-01 5000
## 2 2010-02-01 4230
## 3 2010-03-01 5055
## 4 2010-04-01 4466
## 5 2010-05-01 4072
## 6 2010-06-01 4327
## 7 2010-07-01 3979
## 8 2010-08-01 4369
## 9 2010-09-01 3848
## 10 2010-10-01 4284
tail(dataframe, n=10)
## Date NET_Number
## 126 2020-06-01 9719
## 127 2020-07-01 10394
## 128 2020-08-01 9496
## 129 2020-09-01 10764
## 130 2020-10-01 10419
## 131 2020-11-01 8782
## 132 2020-12-01 8560
## 133 2021-01-01 11428
## 134 2021-02-01 10001
## 135 2021-03-01 11034
After date process, we’ve two columns, namely date and the NET_Number. NET_Number represents newly established firm total as number and for the rest of the analysis, I will try to forecast NET_Number for April 2021.
At first, I convert the dataframe to datatable to manipulate columns easily. Then, I add trend and month variables to use in further models.
datatable = data.table(dataframe)
datatable[, log_NET_Number:=log(dataframe[2])]
datatable[, trend:=1:.N]
month = seq(1, 12, by=1)
datatable = cbind(datatable, month)
Now, the preparation process is done. Let’s move forward to the model composing stage.
For my first model, I use only the trend variable in my datatable and inspect the results of the model with summaries, plottings, and residual analysis.
I use checkresiduals function with parameter lag set to 12, since its default value is 10, but I am using a monthly data thus it should be 12. Also, all of my models anaylze the logarithmic form of the data to simplify the results.
fit1 <- lm(log_NET_Number~trend, data=datatable)
summary(fit1)
##
## Call:
## lm(formula = log_NET_Number ~ trend, data = datatable)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.97583 -0.08758 0.01943 0.13991 0.47072
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.1620562 0.0398989 204.57 <2e-16 ***
## trend 0.0062792 0.0005091 12.34 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2305 on 133 degrees of freedom
## Multiple R-squared: 0.5336, Adjusted R-squared: 0.5301
## F-statistic: 152.1 on 1 and 133 DF, p-value: < 2.2e-16
plot(fit1)
checkresiduals(fit1, lag=12)
##
## Breusch-Godfrey test for serial correlation of order up to 12
##
## data: Residuals
## LM test = 45.166, df = 12, p-value = 9.652e-06
As we can see, trend variable has a significant effect on the model, yet the model performed approximately 53%, which can be improved. The residuals show a clustering approach, thus it seems that there’s information in the model that needs to be extracted. Variance changes at specific points in the dataset, thus let’s keep this in mind for further models. Also, data shows partially normal distribution, and autocorrelative features reside in the current form of the model.
Now, let’s create another linear model with the trend and month variable to extract month information.
fit2 <- lm(log_NET_Number~trend+as.factor(month), data = datatable)
summary(fit2)
##
## Call:
## lm(formula = log_NET_Number ~ trend + as.factor(month), data = datatable)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.94787 -0.09270 0.02377 0.12567 0.40100
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.4150377 0.0689173 122.103 < 2e-16 ***
## trend 0.0062728 0.0004698 13.351 < 2e-16 ***
## as.factor(month)2 -0.1729170 0.0867043 -1.994 0.048346 *
## as.factor(month)3 -0.1218430 0.0867081 -1.405 0.162500
## as.factor(month)4 -0.2801404 0.0886629 -3.160 0.001992 **
## as.factor(month)5 -0.3092730 0.0886566 -3.488 0.000676 ***
## as.factor(month)6 -0.2887088 0.0886529 -3.257 0.001460 **
## as.factor(month)7 -0.3616077 0.0886517 -4.079 8.10e-05 ***
## as.factor(month)8 -0.3985385 0.0886529 -4.495 1.59e-05 ***
## as.factor(month)9 -0.3412691 0.0886566 -3.849 0.000190 ***
## as.factor(month)10 -0.2624056 0.0886629 -2.960 0.003702 **
## as.factor(month)11 -0.2751215 0.0886716 -3.103 0.002383 **
## as.factor(month)12 -0.2607826 0.0886828 -2.941 0.003920 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2124 on 122 degrees of freedom
## Multiple R-squared: 0.6368, Adjusted R-squared: 0.6011
## F-statistic: 17.82 on 12 and 122 DF, p-value: < 2.2e-16
plot(fit2)
checkresiduals(fit2, lag=12)
##
## Breusch-Godfrey test for serial correlation of order up to 12
##
## data: Residuals
## LM test = 42.486, df = 12, p-value = 2.759e-05
In this model, we’ve retrieved 60% performance, which is better than the previous one. Autocorrelative features are extracted for larger lag values, yet there is information residing in lag1 and lag2 state of the data, since p-value of the Breusch-Godfrey test can reject the null hypothesis. Variance did not change significantly, thus let’s try another model with a different approach.
In this model, I’ve added another data set related with our primary one to see the differences between the two data set. My second dataset is Firm Statistics-Liquidated Total in Numbers, starting from January 2010 until March 2021.
supplementary_data <- get_series(series = c("TP.KAP2.TOP.A"), start_date = "01-01-2010", end_date = "01-03-2021")
datatable[, LT_Number:=as.numeric(supplementary_data$items$TP_KAP2_TOP_A)]
datatable[, log_LT_Number:=log(datatable[,"LT_Number"])]
options(repr.plot.width=30, repr.plot.height=18)
ggplot(data=datatable, aes(x=Date,y=LT_Number)) +
geom_smooth(se=FALSE, colour = "blue") +
geom_line(data=datatable, aes(x=Date, y=LT_Number, colour = "real")) +
labs(colour = "Data Types", title = "Liquidated Totals as Numbers", y="LT Number") +
theme(legend.text = element_text(size = 16),
axis.title.y = element_text(size = 19),
axis.title.x = element_text(size = 19),
axis.text.x = element_text(size = 15),
axis.text.y = element_text(size = 15),
plot.title = element_text(size = 22, hjust = 0.5))
fit3 <- lm(diff(log_NET_Number)~diff(log_LT_Number), data = datatable)
summary(fit3)
##
## Call:
## lm(formula = diff(log_NET_Number) ~ diff(log_LT_Number), data = datatable)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.93529 -0.10642 -0.00576 0.10768 0.84596
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.006987 0.017467 0.4 0.69
## diff(log_LT_Number) 0.223025 0.042083 5.3 4.73e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2022 on 132 degrees of freedom
## Multiple R-squared: 0.1754, Adjusted R-squared: 0.1692
## F-statistic: 28.09 on 1 and 132 DF, p-value: 4.734e-07
plot(fit3)
checkresiduals(fit3, lag=12)
##
## Breusch-Godfrey test for serial correlation of order up to 12
##
## data: Residuals
## LM test = 51.603, df = 12, p-value = 7.288e-07
We can see that the model performed only 16%, which is quite bad. Yet, we can see that the variance of the residuals are minimal and residuals are in normal distribution with mean almost 0. It seems that this data can be used with other variables to complete the model.
Now, I’ve added the pandemic status for firms that has affected them. In March, April, and May 2020, there’s been a lockdown which ultimately affects the newly established firms since face-to-face interactions are restricted. Hence, my next model includes the factor of PandemicClosure variable, which indicates the pandemic status.
datatable[, PandemicClosure:=c(rep(0, 122), rep(1,3), rep(0,10))]
fit6 <- lm(log_NET_Number~trend+as.factor(month)+as.factor(PandemicClosure), data = datatable)
summary(fit6)
##
## Call:
## lm(formula = log_NET_Number ~ trend + as.factor(month) + as.factor(PandemicClosure),
## data = datatable)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.50503 -0.09263 0.00297 0.12725 0.54947
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.3773425 0.0620618 134.984 < 2e-16 ***
## trend 0.0068354 0.0004325 15.806 < 2e-16 ***
## as.factor(month)2 -0.1734796 0.0776171 -2.235 0.027248 *
## as.factor(month)3 -0.0679096 0.0782179 -0.868 0.386998
## as.factor(month)4 -0.2183886 0.0801356 -2.725 0.007379 **
## as.factor(month)5 -0.2480838 0.0801162 -3.097 0.002434 **
## as.factor(month)6 -0.2881461 0.0793614 -3.631 0.000415 ***
## as.factor(month)7 -0.3616077 0.0793603 -4.557 1.25e-05 ***
## as.factor(month)8 -0.3991011 0.0793614 -5.029 1.73e-06 ***
## as.factor(month)9 -0.3423943 0.0793650 -4.314 3.29e-05 ***
## as.factor(month)10 -0.2640934 0.0793709 -3.327 0.001162 **
## as.factor(month)11 -0.2773719 0.0793791 -3.494 0.000665 ***
## as.factor(month)12 -0.2635956 0.0793897 -3.320 0.001189 **
## as.factor(PandemicClosure)1 -0.6607038 0.1182101 -5.589 1.43e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1901 on 121 degrees of freedom
## Multiple R-squared: 0.7113, Adjusted R-squared: 0.6803
## F-statistic: 22.93 on 13 and 121 DF, p-value: < 2.2e-16
plot(fit6)
checkresiduals(fit6, lag=12)
##
## Breusch-Godfrey test for serial correlation of order up to 12
##
## data: Residuals
## LM test = 60.144, df = 12, p-value = 2.125e-08
We can see that model performance has increased to 68%, which is 8% more than the previous maximum performance. Yet, the variance became changeable and autocorrelative features are not yet diminished. Residuals still show normal distribution, which holds one of the conditions.
In the next model, I introduce the lag1 variables for log_NET_Number and log_LT_Number to increse the model performance by extracting autoregressive features.
In the first model, only lag1 of log_NET_Number is included. In the second model, only lag1 of log_LT_Number is included. In the third one, both of the lag1 variables are included.
lagged_log_NET_Number <- datatable$log_NET_Number[2:nrow(datatable)]
lagged_log_NET_Number <- c(lagged_log_NET_Number, rep(lagged_log_NET_Number[1], 1))
datatable[, lagged_log_NET_Number:=lagged_log_NET_Number]
lagged_log_LT_Number <- datatable$log_LT_Number[2:nrow(datatable)]
lagged_log_LT_Number <- c(lagged_log_LT_Number, rep(lagged_log_LT_Number[1], 1))
datatable[, lagged_log_LT_Number:=lagged_log_LT_Number]
fit7 <- lm(log_NET_Number~trend+as.factor(month)+as.factor(PandemicClosure)+lagged_log_NET_Number, data = datatable)
summary(fit7)
##
## Call:
## lm(formula = log_NET_Number ~ trend + as.factor(month) + as.factor(PandemicClosure) +
## lagged_log_NET_Number, data = datatable)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.46707 -0.09669 0.01529 0.09737 0.71608
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.2326913 0.6310900 8.292 1.87e-13 ***
## trend 0.0043367 0.0006368 6.810 4.12e-10 ***
## as.factor(month)2 -0.1928915 0.0710017 -2.717 0.007568 **
## as.factor(month)3 -0.0228553 0.0720097 -0.317 0.751498
## as.factor(month)4 -0.1845226 0.0735083 -2.510 0.013397 *
## as.factor(month)5 -0.2219727 0.0733642 -3.026 0.003036 **
## as.factor(month)6 -0.2161545 0.0739033 -2.925 0.004122 **
## as.factor(month)7 -0.2754038 0.0745075 -3.696 0.000331 ***
## as.factor(month)8 -0.3346763 0.0736238 -4.546 1.32e-05 ***
## as.factor(month)9 -0.3079989 0.0728174 -4.230 4.60e-05 ***
## as.factor(month)10 -0.2247376 0.0729230 -3.082 0.002552 **
## as.factor(month)11 -0.2433925 0.0728225 -3.342 0.001109 **
## as.factor(month)12 -0.3258218 0.0735736 -4.429 2.11e-05 ***
## as.factor(PandemicClosure)1 -0.4635207 0.1149416 -4.033 9.73e-05 ***
## lagged_log_NET_Number 0.3820732 0.0763671 5.003 1.95e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1737 on 120 degrees of freedom
## Multiple R-squared: 0.7611, Adjusted R-squared: 0.7333
## F-statistic: 27.31 on 14 and 120 DF, p-value: < 2.2e-16
plot(fit7)
checkresiduals(fit7, lag=12)
##
## Breusch-Godfrey test for serial correlation of order up to 12
##
## data: Residuals
## LM test = 34.962, df = 12, p-value = 0.0004749
In the first model, we’ve retrieved 73.33% performance, which is better than the previous version. Residuals seem to ungroup and variance become lesser than before. Autocorrelative features get diminished significantly and we’ve the highest p-value for Breusch-Godfrey test, yet we cannot reject the null hypothesis.
fit8 <- lm(log_NET_Number~trend+as.factor(month)+as.factor(PandemicClosure)+lagged_log_LT_Number, data = datatable)
summary(fit8)
##
## Call:
## lm(formula = log_NET_Number ~ trend + as.factor(month) + as.factor(PandemicClosure) +
## lagged_log_LT_Number, data = datatable)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.52678 -0.09558 0.00556 0.12099 0.52949
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.8192763 0.5571153 15.830 < 2e-16 ***
## trend 0.0067823 0.0004382 15.478 < 2e-16 ***
## as.factor(month)2 -0.1753126 0.0777676 -2.254 0.02599 *
## as.factor(month)3 -0.0723087 0.0785291 -0.921 0.35901
## as.factor(month)4 -0.2225853 0.0804280 -2.768 0.00654 **
## as.factor(month)5 -0.2377027 0.0812837 -2.924 0.00413 **
## as.factor(month)6 -0.2737120 0.0815117 -3.358 0.00105 **
## as.factor(month)7 -0.3579051 0.0796148 -4.495 1.61e-05 ***
## as.factor(month)8 -0.3924876 0.0799114 -4.912 2.89e-06 ***
## as.factor(month)9 -0.3312696 0.0806968 -4.105 7.41e-05 ***
## as.factor(month)10 -0.2560933 0.0801194 -3.196 0.00178 **
## as.factor(month)11 -0.2188248 0.1081645 -2.023 0.04529 *
## as.factor(month)12 -0.2067585 0.1067314 -1.937 0.05507 .
## as.factor(PandemicClosure)1 -0.6827602 0.1215696 -5.616 1.28e-07 ***
## lagged_log_LT_Number -0.0642193 0.0804514 -0.798 0.42631
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1904 on 120 degrees of freedom
## Multiple R-squared: 0.7128, Adjusted R-squared: 0.6793
## F-statistic: 21.28 on 14 and 120 DF, p-value: < 2.2e-16
plot(fit8)
checkresiduals(fit7, lag=12)
##
## Breusch-Godfrey test for serial correlation of order up to 12
##
## data: Residuals
## LM test = 34.962, df = 12, p-value = 0.0004749
In the second model, we’ve retrieved 67.93% performance, which is worse than the first version. No significant change in residuals, their distribution, and autocorrelative features.
fit9 <- lm(log_NET_Number~trend+as.factor(month)+as.factor(PandemicClosure)+lagged_log_NET_Number+lagged_log_LT_Number, data = datatable)
summary(fit9)
##
## Call:
## lm(formula = log_NET_Number ~ trend + as.factor(month) + as.factor(PandemicClosure) +
## lagged_log_NET_Number + lagged_log_LT_Number, data = datatable)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.51244 -0.09653 0.01229 0.10256 0.68422
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.9821648 0.7290957 8.205 3.11e-13 ***
## trend 0.0040069 0.0006507 6.158 1.03e-08 ***
## as.factor(month)2 -0.1987049 0.0702107 -2.830 0.005464 **
## as.factor(month)3 -0.0291695 0.0712166 -0.410 0.682844
## as.factor(month)4 -0.1913060 0.0727065 -2.631 0.009634 **
## as.factor(month)5 -0.1960353 0.0736541 -2.662 0.008852 **
## as.factor(month)6 -0.1771164 0.0756230 -2.342 0.020837 *
## as.factor(month)7 -0.2597447 0.0740354 -3.508 0.000637 ***
## as.factor(month)8 -0.3141694 0.0734713 -4.276 3.86e-05 ***
## as.factor(month)9 -0.2796691 0.0733478 -3.813 0.000219 ***
## as.factor(month)10 -0.2031462 0.0728657 -2.788 0.006177 **
## as.factor(month)11 -0.1065516 0.0996861 -1.069 0.287291
## as.factor(month)12 -0.2009127 0.0961802 -2.089 0.038847 *
## as.factor(PandemicClosure)1 -0.4975727 0.1148526 -4.332 3.10e-05 ***
## lagged_log_NET_Number 0.4139150 0.0771398 5.366 4.03e-07 ***
## lagged_log_LT_Number -0.1469921 0.0741166 -1.983 0.049642 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1716 on 119 degrees of freedom
## Multiple R-squared: 0.7688, Adjusted R-squared: 0.7396
## F-statistic: 26.38 on 15 and 119 DF, p-value: < 2.2e-16
plot(fit9)
checkresiduals(fit7, lag=12)
##
## Breusch-Godfrey test for serial correlation of order up to 12
##
## data: Residuals
## LM test = 34.962, df = 12, p-value = 0.0004749
In the third model, we’ve retrieved 73.96% performance, a very slight improvement compared to the first model. Also, no significant change in residuals, their distribution, and autocorrelative features.
Now, let’s add the lag0 data of the liquidated total in numbers to increase the model performance, since we now that liquidated total in numbers have resemblance with the newly established total in numbers because variance of their difference was lower than all previous models, and the distribution of the residuals seem to comply with the normal distribution.
fit10 <- lm(log_NET_Number~trend+as.factor(month)+as.factor(PandemicClosure)+log_LT_Number+lagged_log_NET_Number+lagged_log_LT_Number, data = datatable)
summary(fit10)
##
## Call:
## lm(formula = log_NET_Number ~ trend + as.factor(month) + as.factor(PandemicClosure) +
## log_LT_Number + lagged_log_NET_Number + lagged_log_LT_Number,
## data = datatable)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.41323 -0.07981 0.00040 0.10088 0.53324
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.7063298 0.8028309 4.617 1.00e-05 ***
## trend 0.0037256 0.0005953 6.258 6.50e-09 ***
## as.factor(month)2 0.1277886 0.0909873 1.404 0.162807
## as.factor(month)3 0.2997990 0.0919808 3.259 0.001459 **
## as.factor(month)4 0.1719763 0.0978359 1.758 0.081374 .
## as.factor(month)5 0.1973701 0.1028722 1.919 0.057451 .
## as.factor(month)6 0.1624014 0.0963035 1.686 0.094371 .
## as.factor(month)7 0.0148902 0.0866676 0.172 0.863882
## as.factor(month)8 0.0294738 0.0954934 0.309 0.758134
## as.factor(month)9 0.0541545 0.0940347 0.576 0.565780
## as.factor(month)10 0.0954249 0.0889265 1.073 0.285426
## as.factor(month)11 0.3559394 0.1290344 2.758 0.006733 **
## as.factor(month)12 -0.0643008 0.0916930 -0.701 0.484520
## as.factor(PandemicClosure)1 -0.3743378 0.1074242 -3.485 0.000692 ***
## log_LT_Number 0.3858988 0.0765008 5.044 1.67e-06 ***
## lagged_log_NET_Number 0.4822042 0.0715524 6.739 6.18e-10 ***
## lagged_log_LT_Number -0.3326156 0.0768845 -4.326 3.19e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1563 on 118 degrees of freedom
## Multiple R-squared: 0.8098, Adjusted R-squared: 0.784
## F-statistic: 31.4 on 16 and 118 DF, p-value: < 2.2e-16
plot(fit10)
checkresiduals(fit10, lag=12)
##
## Breusch-Godfrey test for serial correlation of order up to 12
##
## data: Residuals
## LM test = 44.838, df = 12, p-value = 1.098e-05
We can see that the model performance is now 78.4%, which is the highest of all previous models. Also, residuals seem to moved away from each other. Variance is slightly affected yet considerably small. The distribution of the residuals are complying with the normal distribution and autocorrelative features are sufficiently diminished. This model can be used to predict the next month’s log_NET_Number.
predictions <- predict(fit10, datatable)
datatable[, predictions:=predictions]
options(repr.plot.width=30, repr.plot.height=18)
ggplot(data=datatable, aes(x=Date,y=LT_Number)) +
geom_line(data=datatable, aes(x=Date, y=log_NET_Number, colour = "real")) +
geom_line(data=datatable, aes(x=Date, y=predictions, colour = "pred")) +
scale_colour_manual("", breaks = c("real", "pred"), values = c("red", "purple")) +
labs(colour = "Data Types", title = "Newly Established Totals as Numbers", y="NET Number") +
theme(legend.text = element_text(size = 16),
axis.title.y = element_text(size = 19),
axis.title.x = element_text(size = 19),
axis.text.x = element_text(size = 15),
axis.text.y = element_text(size = 15),
plot.title = element_text(size = 22, hjust = 0.5))
We can see that our model shows similar formation with the actual data, but not exactly similar. Thus our prediction for the next month can be close to the actual value.
Now, let’s add the next month’s row for prediction and set the values based on the previous month.
datatable <- rbind(datatable, data.table(Date=as.Date("2021-04-01")), fill=TRUE)
datatable[Date==as.Date("2021-04-01"), trend:=1+datatable[.N-1,trend]]
datatable[Date==as.Date("2021-04-01"), month:=(1+datatable[.N-1, month])%%12]
datatable[Date==as.Date("2021-04-01"), LT_Number:=datatable[.N-1, LT_Number]]
datatable[Date==as.Date("2021-04-01"), log_LT_Number:=datatable[.N-1, log_LT_Number]]
datatable[Date==as.Date("2021-04-01"), lagged_log_NET_Number:=datatable[.N-1, lagged_log_NET_Number]]
datatable[Date==as.Date("2021-04-01"), lagged_log_LT_Number:=datatable[.N-1, lagged_log_LT_Number]]
datatable[Date==as.Date("2021-04-01"), PandemicClosure:=datatable[.N-1,PandemicClosure]]
tail(datatable, n=1)
## Date NET_Number log_NET_Number trend month LT_Number log_LT_Number
## 1: 2021-04-01 NA NA 136 4 1015 6.922644
## PandemicClosure lagged_log_NET_Number lagged_log_LT_Number predictions
## 1: 0 8.349957 6.969791 NA
Then, predict the value of April based on March’s values.
april_prediction <- predict(fit7, datatable[Date==as.Date("2021-04-01")])
datatable[Date==as.Date("2021-04-01"), log_NET_Number:=april_prediction]
datatable[Date==as.Date("2021-04-01"), NET_Number:=round(exp(april_prediction))]
tail(datatable, n=1)
## Date NET_Number log_NET_Number trend month LT_Number log_LT_Number
## 1: 2021-04-01 6824 8.828251 136 4 1015 6.922644
## PandemicClosure lagged_log_NET_Number lagged_log_LT_Number predictions
## 1: 0 8.349957 6.969791 NA
My model finds the newly established firms total in numbers as 6824. Since the actual value is not published in the website, I cannot compare my value with its real counterpart. Therefore, my analysis of time series regression is hereby finished.
In this task, I’ve learned more R syntax and libraries with their useful functions as practical benefits and the assumptions of residuals and how to reform model to comply with the assumptions. Thank you for your time to read until here.